{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## 贝叶斯拼写纠错实例\n",
    "\n",
    "本代码实现了输入者输入错误时，纠正输入的错误，把输入值想输的单词输出\n",
    " p（c）和p（w\\c）分别用语料库词频和插删换改来代替，最后综合考虑两个因素\n",
    "\n",
    "### 求解 argmaxc P(c|w) -> argmaxc P(w|c)P(c)/P(w)\n",
    " * P(c) 文章中出现一个正确拼写词c的概率\n",
    " *  P(w|c) 用户想键入c的情况下 敲成w的概率\n",
    " * argmax 用来枚举所有可能的c， 并且选概率最大"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "outputs": [],
   "source": [
    "#re是正则表达式模块，re.findall遍历匹配，可以获取字符串中所有匹配的字符串，返回一个列表\n",
    "import re, collections\n",
    "\n",
    "# 读取语料库，去除除字母类其他字符并转化为小写，\n",
    "def words(text):\n",
    "    return re.findall('[a-z]+', text.lower()) # text.lower（）的意思是把大写转化为小写字母，+的意思表至少匹配一次，也就是返回的有很多小写字母的字符串\n",
    "\n",
    "def train(features):\n",
    "    model = collections.defaultdict(lambda: 1) # 用来计算单词出现频率，初始值为1，因为如果是0的话，会使得先验概率为0\n",
    "    for f in features:\n",
    "        model[f] += 1\n",
    "   # print(model)\n",
    "    return model\n",
    "\n",
    "NWORDS = train(words(open('/Users/edgar/Downloads/贝叶斯-拼写检查器/big.txt').read())) # 读取语料库并train，得到词库中各个单词出现的频率\n",
    "# 完整的big.txt"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "outputs": [
    {
     "data": {
      "text/plain": "defaultdict(<function __main__.train.<locals>.<lambda>()>,\n            {'the': 80031,\n             'project': 289,\n             'gutenberg': 264,\n             'ebook': 88,\n             'of': 40026,\n             'adventures': 18,\n             'sherlock': 102,\n             'holmes': 468,\n             'by': 6739,\n             'sir': 178,\n             'arthur': 35,\n             'conan': 5,\n             'doyle': 6,\n             'in': 22048,\n             'our': 1067,\n             'series': 129,\n             'copyright': 70,\n             'laws': 234,\n             'are': 3631,\n             'changing': 45,\n             'all': 4145,\n             'over': 1283,\n             'world': 363,\n             'be': 6156,\n             'sure': 124,\n             'to': 28767,\n             'check': 39,\n             'for': 6940,\n             'your': 1280,\n             'country': 424,\n             'before': 1364,\n             'downloading': 6,\n             'or': 5353,\n             'redistributing': 8,\n             'this': 4064,\n             'any': 1205,\n             'other': 1503,\n             'header': 8,\n             'should': 1298,\n             'first': 1178,\n             'thing': 304,\n             'seen': 445,\n             'when': 2924,\n             'viewing': 8,\n             'file': 22,\n             'please': 173,\n             'do': 1504,\n             'not': 6626,\n             'remove': 54,\n             'it': 10682,\n             'change': 151,\n             'edit': 5,\n             'without': 1016,\n             'written': 118,\n             'permission': 53,\n             'read': 219,\n             'legal': 53,\n             'small': 528,\n             'print': 48,\n             'and': 38313,\n             'information': 74,\n             'about': 1498,\n             'at': 6792,\n             'bottom': 43,\n             'included': 44,\n             'is': 9775,\n             'important': 286,\n             'specific': 38,\n             'rights': 169,\n             'restrictions': 24,\n             'how': 1316,\n             'may': 2552,\n             'used': 277,\n             'you': 5623,\n             'can': 1096,\n             'also': 779,\n             'find': 295,\n             'out': 1988,\n             'make': 505,\n             'a': 21156,\n             'donation': 11,\n             'get': 469,\n             'involved': 108,\n             'welcome': 19,\n             'free': 422,\n             'plain': 109,\n             'vanilla': 7,\n             'electronic': 59,\n             'texts': 8,\n             'ebooks': 55,\n             'readable': 14,\n             'both': 530,\n             'humans': 3,\n             'computers': 8,\n             'since': 261,\n             'these': 1232,\n             'were': 4290,\n             'prepared': 139,\n             'thousands': 94,\n             'volunteers': 23,\n             'title': 40,\n             'author': 30,\n             'release': 29,\n             'date': 49,\n             'march': 136,\n             'most': 909,\n             'recently': 31,\n             'updated': 5,\n             'november': 42,\n             'edition': 22,\n             'language': 62,\n             'english': 212,\n             'character': 175,\n             'set': 325,\n             'encoding': 6,\n             'ascii': 12,\n             'start': 68,\n             'additional': 31,\n             'editing': 7,\n             'jose': 2,\n             'menendez': 2,\n             'contents': 51,\n             'i': 7683,\n             'scandal': 20,\n             'bohemia': 16,\n             'ii': 78,\n             'red': 289,\n             'headed': 38,\n             'league': 54,\n             'iii': 92,\n             'case': 439,\n             'identity': 12,\n             'iv': 56,\n             'boscombe': 17,\n             'valley': 79,\n             'mystery': 40,\n             'v': 52,\n             'five': 280,\n             'orange': 24,\n             'pips': 13,\n             'vi': 38,\n             'man': 1653,\n             'with': 9741,\n             'twisted': 22,\n             'lip': 57,\n             'vii': 35,\n             'adventure': 35,\n             'blue': 144,\n             'carbuncle': 18,\n             'viii': 40,\n             'speckled': 6,\n             'band': 55,\n             'ix': 29,\n             'engineer': 13,\n             's': 5632,\n             'thumb': 52,\n             'x': 137,\n             'noble': 49,\n             'bachelor': 19,\n             'xi': 29,\n             'beryl': 5,\n             'coronet': 30,\n             'xii': 29,\n             'copper': 27,\n             'beeches': 13,\n             'she': 3947,\n             'always': 609,\n             'woman': 326,\n             'have': 3494,\n             'seldom': 77,\n             'heard': 637,\n             'him': 5231,\n             'mention': 47,\n             'her': 5285,\n             'under': 964,\n             'name': 263,\n             'his': 10035,\n             'eyes': 940,\n             'eclipses': 3,\n             'predominates': 4,\n             'whole': 745,\n             'sex': 12,\n             'was': 11411,\n             'that': 12513,\n             'he': 12402,\n             'felt': 698,\n             'emotion': 37,\n             'akin': 15,\n             'love': 485,\n             'irene': 19,\n             'adler': 17,\n             'emotions': 11,\n             'one': 3372,\n             'particularly': 175,\n             'abhorrent': 2,\n             'cold': 258,\n             'precise': 14,\n             'but': 5654,\n             'admirably': 8,\n             'balanced': 7,\n             'mind': 342,\n             'take': 617,\n             'perfect': 40,\n             'reasoning': 42,\n             'observing': 22,\n             'machine': 40,\n             'has': 1604,\n             'as': 8065,\n             'lover': 27,\n             'would': 1954,\n             'placed': 183,\n             'himself': 1159,\n             'false': 65,\n             'position': 433,\n             'never': 594,\n             'spoke': 219,\n             'softer': 11,\n             'passions': 30,\n             'save': 111,\n             'gibe': 3,\n             'sneer': 7,\n             'they': 3939,\n             'admirable': 15,\n             'things': 322,\n             'observer': 14,\n             'excellent': 63,\n             'drawing': 241,\n             'veil': 17,\n             'from': 5710,\n             'men': 1146,\n             'motives': 15,\n             'actions': 78,\n             'trained': 24,\n             'reasoner': 7,\n             'admit': 66,\n             'such': 1437,\n             'intrusions': 2,\n             'into': 2125,\n             'own': 786,\n             'delicate': 55,\n             'finely': 12,\n             'adjusted': 17,\n             'temperament': 6,\n             'introduce': 24,\n             'distracting': 2,\n             'factor': 42,\n             'which': 4843,\n             'might': 537,\n             'throw': 49,\n             'doubt': 153,\n             'upon': 1112,\n             'mental': 38,\n             'results': 230,\n             'grit': 2,\n             'sensitive': 36,\n             'instrument': 36,\n             'crack': 21,\n             'high': 291,\n             'power': 549,\n             'lenses': 2,\n             'more': 1998,\n             'disturbing': 10,\n             'than': 1207,\n             'strong': 169,\n             'nature': 171,\n             'yet': 489,\n             'there': 2973,\n             'late': 166,\n             'dubious': 2,\n             'questionable': 4,\n             'memory': 56,\n             'had': 7384,\n             'little': 1002,\n             'lately': 23,\n             'my': 2250,\n             'marriage': 97,\n             'drifted': 6,\n             'us': 685,\n             'away': 839,\n             'each': 412,\n             'complete': 146,\n             'happiness': 144,\n             'home': 296,\n             'centred': 3,\n             'interests': 119,\n             'rise': 241,\n             'up': 2285,\n             'around': 272,\n             'who': 3051,\n             'finds': 24,\n             'master': 142,\n             'establishment': 41,\n             'sufficient': 76,\n             'absorb': 5,\n             'attention': 192,\n             'while': 769,\n             'loathed': 2,\n             'every': 651,\n             'form': 508,\n             'society': 170,\n             'bohemian': 9,\n             'soul': 169,\n             'remained': 232,\n             'lodgings': 12,\n             'baker': 50,\n             'street': 181,\n             'buried': 22,\n             'among': 452,\n             'old': 1181,\n             'books': 60,\n             'alternating': 3,\n             'week': 96,\n             'between': 655,\n             'cocaine': 5,\n             'ambition': 14,\n             'drowsiness': 5,\n             'drug': 22,\n             'fierce': 13,\n             'energy': 46,\n             'keen': 33,\n             'still': 923,\n             'ever': 275,\n             'deeply': 78,\n             'attracted': 37,\n             'study': 145,\n             'crime': 62,\n             'occupied': 117,\n             'immense': 78,\n             'faculties': 9,\n             'extraordinary': 75,\n             'powers': 150,\n             'observation': 40,\n             'following': 209,\n             'those': 1202,\n             'clues': 4,\n             'clearing': 30,\n             'mysteries': 10,\n             'been': 2600,\n             'abandoned': 73,\n             'hopeless': 18,\n             'official': 92,\n             'police': 95,\n             'time': 1530,\n             'some': 1537,\n             'vague': 40,\n             'account': 178,\n             'doings': 12,\n             'summons': 12,\n             'odessa': 4,\n             'trepoff': 2,\n             'murder': 31,\n             'singular': 37,\n             'tragedy': 10,\n             'atkinson': 2,\n             'brothers': 51,\n             'trincomalee': 2,\n             'finally': 157,\n             'mission': 35,\n             'accomplished': 40,\n             'so': 3018,\n             'delicately': 4,\n             'successfully': 26,\n             'reigning': 4,\n             'family': 211,\n             'holland': 13,\n             'beyond': 226,\n             'signs': 99,\n             'activity': 132,\n             'however': 431,\n             'merely': 190,\n             'shared': 26,\n             'readers': 12,\n             'daily': 45,\n             'press': 82,\n             'knew': 497,\n             'former': 178,\n             'friend': 284,\n             'companion': 82,\n             'night': 386,\n             'on': 6644,\n             'twentieth': 20,\n             'returning': 69,\n             'journey': 70,\n             'patient': 384,\n             'now': 1698,\n             'returned': 195,\n             'civil': 178,\n             'practice': 96,\n             'way': 860,\n             'led': 197,\n             'me': 1921,\n             'through': 816,\n             'passed': 368,\n             'well': 1199,\n             'remembered': 121,\n             'door': 499,\n             'must': 956,\n             'associated': 197,\n             'wooing': 3,\n             'dark': 182,\n             'incidents': 15,\n             'scarlet': 23,\n             'seized': 115,\n             'desire': 97,\n             'see': 1102,\n             'again': 867,\n             'know': 1049,\n             'employing': 8,\n             'rooms': 87,\n             'brilliantly': 6,\n             'lit': 75,\n             'even': 947,\n             'looked': 761,\n             'saw': 600,\n             'tall': 75,\n             'spare': 28,\n             'figure': 104,\n             'pass': 155,\n             'twice': 85,\n             'silhouette': 2,\n             'against': 661,\n             'blind': 24,\n             'pacing': 27,\n             'room': 961,\n             'swiftly': 39,\n             'eagerly': 40,\n             'head': 726,\n             'sunk': 28,\n             'chest': 82,\n             'hands': 456,\n             'clasped': 12,\n             'behind': 402,\n             'mood': 52,\n             'habit': 56,\n             'attitude': 73,\n             'manner': 136,\n             'told': 491,\n             'their': 2956,\n             'story': 134,\n             'work': 383,\n             'risen': 31,\n             'created': 63,\n             'dreams': 17,\n             'hot': 120,\n             'scent': 18,\n             'new': 1212,\n             'problem': 77,\n             'rang': 30,\n             'bell': 66,\n             'shown': 114,\n             'chamber': 36,\n             'formerly': 78,\n             'part': 705,\n             'effusive': 3,\n             'glad': 151,\n             'think': 558,\n             'hardly': 174,\n             'word': 299,\n             'spoken': 93,\n             'kindly': 87,\n             'eye': 111,\n             'waved': 30,\n             'an': 3424,\n             'armchair': 50,\n             'threw': 97,\n             'across': 223,\n             'cigars': 8,\n             'indicated': 89,\n             'spirit': 168,\n             'gasogene': 2,\n             'corner': 129,\n             'then': 1559,\n             'stood': 384,\n             'fire': 275,\n             'introspective': 4,\n             'fashion': 50,\n             'wedlock': 2,\n             'suits': 9,\n             'remarked': 170,\n             'watson': 84,\n             'put': 436,\n             'seven': 133,\n             'half': 319,\n             'pounds': 27,\n             'answered': 227,\n             'indeed': 140,\n             'thought': 903,\n             'just': 768,\n             'trifle': 12,\n             'fancy': 51,\n             'observe': 38,\n             'did': 1876,\n             'tell': 493,\n             'intended': 59,\n             'go': 906,\n             'harness': 28,\n             'deduce': 15,\n             'getting': 93,\n             'yourself': 163,\n             'very': 1341,\n             'wet': 61,\n             'clumsy': 9,\n             'careless': 15,\n             'servant': 47,\n             'girl': 167,\n             'dear': 450,\n             'said': 3465,\n             'too': 549,\n             'much': 672,\n             'certainly': 120,\n             'burned': 78,\n             'lived': 114,\n             'few': 459,\n             'centuries': 13,\n             'ago': 109,\n             'true': 206,\n             'walk': 76,\n             'thursday': 8,\n             'came': 980,\n             'dreadful': 69,\n             'mess': 11,\n             'changed': 135,\n             'clothes': 63,\n             't': 1319,\n             'imagine': 97,\n             'mary': 706,\n             'jane': 3,\n             'incorrigible': 3,\n             'wife': 368,\n             'given': 365,\n             'notice': 99,\n             'fail': 41,\n             'chuckled': 8,\n             'rubbed': 33,\n             'long': 992,\n             'nervous': 55,\n             'together': 261,\n             'simplicity': 31,\n             'itself': 274,\n             'inside': 44,\n             'left': 835,\n             'shoe': 12,\n             'where': 978,\n             'firelight': 3,\n             'strikes': 20,\n             'leather': 36,\n             'scored': 5,\n             'six': 177,\n             'almost': 326,\n             'parallel': 18,\n             'cuts': 6,\n             'obviously': 39,\n             'caused': 103,\n             'someone': 161,\n             'carelessly': 15,\n             'scraped': 22,\n             'round': 557,\n             'edges': 71,\n             'sole': 71,\n             'order': 405,\n             'crusted': 3,\n             'mud': 37,\n             'hence': 33,\n             'double': 50,\n             'deduction': 13,\n             'vile': 17,\n             'weather': 43,\n             'malignant': 89,\n             'boot': 23,\n             'slitting': 3,\n             'specimen': 15,\n             'london': 77,\n             'slavey': 2,\n             'if': 2373,\n             'gentleman': 100,\n             'walks': 11,\n             'smelling': 6,\n             'iodoform': 44,\n             'black': 236,\n             'mark': 39,\n             'nitrate': 8,\n             'silver': 129,\n             'right': 711,\n             'forefinger': 8,\n             'bulge': 3,\n             'side': 512,\n             'top': 43,\n             'hat': 106,\n             'show': 214,\n             'secreted': 3,\n             'stethoscope': 3,\n             'dull': 75,\n             'pronounce': 10,\n             'active': 97,\n             'member': 51,\n             'medical': 23,\n             'profession': 23,\n             'could': 1701,\n             'help': 231,\n             'laughing': 116,\n             'ease': 45,\n             'explained': 61,\n             'process': 220,\n             'hear': 184,\n             'give': 524,\n             'reasons': 65,\n             'appears': 109,\n             'ridiculously': 2,\n             'simple': 140,\n             'easily': 115,\n             'myself': 228,\n             'though': 651,\n             'successive': 18,\n             'instance': 51,\n             'am': 747,\n             'baffled': 9,\n             'until': 326,\n             'explain': 124,\n             'believe': 184,\n             'good': 745,\n             'yours': 47,\n             'quite': 503,\n             'lighting': 17,\n             'cigarette': 7,\n             'throwing': 47,\n             'down': 1129,\n             'distinction': 20,\n             'clear': 234,\n             'example': 287,\n             'frequently': 219,\n             'steps': 189,\n             'lead': 138,\n             'hall': 84,\n             'often': 444,\n             'hundreds': 49,\n             'times': 237,\n             'many': 610,\n             'don': 582,\n             'observed': 132,\n             'point': 224,\n             'seventeen': 11,\n             'because': 631,\n             'interested': 66,\n             'problems': 79,\n             'enough': 176,\n             'chronicle': 8,\n             'two': 1139,\n             'trifling': 13,\n             'experiences': 12,\n             'sheet': 30,\n             'thick': 78,\n             'pink': 28,\n             'tinted': 10,\n             'notepaper': 3,\n             'lying': 119,\n             'open': 326,\n             'table': 297,\n             'last': 566,\n             'post': 118,\n             'aloud': 29,\n             'note': 116,\n             'undated': 2,\n             'either': 294,\n             'signature': 10,\n             'address': 77,\n             'will': 1578,\n             'call': 198,\n             'quarter': 47,\n             'eight': 129,\n             'o': 258,\n             'clock': 121,\n             'desires': 23,\n             'consult': 20,\n             'matter': 366,\n             'deepest': 16,\n             'moment': 488,\n             'recent': 55,\n             'services': 39,\n             'royal': 112,\n             'houses': 118,\n             'europe': 154,\n             'safely': 12,\n             'trusted': 17,\n             'matters': 137,\n             'importance': 118,\n             'exaggerated': 29,\n             'we': 1907,\n             'quarters': 73,\n             'received': 281,\n             'hour': 158,\n             'amiss': 7,\n             'visitor': 75,\n             'wear': 31,\n             'mask': 13,\n             'what': 3012,\n             'means': 254,\n             'no': 2349,\n             'data': 18,\n             'capital': 145,\n             'mistake': 40,\n             'theorise': 2,\n             'insensibly': 3,\n             'begins': 48,\n             'twist': 15,\n             'facts': 73,\n             'suit': 26,\n             'theories': 22,\n             'instead': 138,\n             'carefully': 73,\n             'examined': 50,\n             'writing': 70,\n             'paper': 178,\n             'wrote': 150,\n             'presumably': 9,\n             'endeavouring': 9,\n             'imitate': 8,\n             'processes': 36,\n             'bought': 56,\n             'crown': 62,\n             'packet': 12,\n             'peculiarly': 15,\n             'stiff': 21,\n             'peculiar': 85,\n             'hold': 115,\n             'light': 279,\n             'large': 484,\n             'e': 137,\n             'g': 56,\n             'p': 67,\n             'woven': 6,\n             'texture': 7,\n             'asked': 778,\n             'maker': 5,\n             'monogram': 5,\n             'rather': 220,\n             'stands': 20,\n             'gesellschaft': 2,\n             'german': 197,\n             'company': 193,\n             'customary': 20,\n             'contraction': 62,\n             'like': 1081,\n             'co': 31,\n             'course': 390,\n             'papier': 2,\n             'eg': 2,\n             'let': 507,\n             'glance': 92,\n             'continental': 47,\n             'gazetteer': 2,\n             'took': 574,\n             'heavy': 140,\n             'brown': 72,\n             'volume': 31,\n             'shelves': 4,\n             'eglow': 2,\n             'eglonitz': 2,\n             'here': 692,\n             'egria': 2,\n             'speaking': 186,\n             'far': 409,\n             'carlsbad': 2,\n             'remarkable': 78,\n             'being': 919,\n             'scene': 50,\n             'death': 331,\n             'wallenstein': 2,\n             'its': 1636,\n             'numerous': 51,\n             'glass': 117,\n             'factories': 30,\n             'mills': 40,\n             'ha': 76,\n             'boy': 170,\n             'sparkled': 6,\n             'sent': 320,\n             'great': 793,\n             'triumphant': 17,\n             'cloud': 31,\n             'made': 1008,\n             'precisely': 25,\n             'construction': 26,\n             'sentence': 27,\n             'frenchman': 103,\n             'russian': 462,\n             'uncourteous': 2,\n             'verbs': 2,\n             'only': 1874,\n             'remains': 74,\n             'therefore': 187,\n             'discover': 29,\n             'wanted': 214,\n             'writes': 21,\n             'prefers': 3,\n             'wearing': 88,\n             'showing': 105,\n             'face': 1126,\n             'comes': 92,\n             'mistaken': 60,\n             'resolve': 15,\n             'doubts': 40,\n             'sharp': 84,\n             'sound': 220,\n             'horses': 263,\n             'hoofs': 25,\n             'grating': 11,\n             'wheels': 48,\n             'curb': 5,\n             'followed': 330,\n             'pull': 24,\n             'whistled': 14,\n             'pair': 41,\n             'yes': 689,\n             'continued': 292,\n             'glancing': 99,\n             'window': 187,\n             'nice': 54,\n             'brougham': 5,\n             'beauties': 3,\n             'hundred': 230,\n             'fifty': 95,\n             'guineas': 4,\n             'apiece': 8,\n             'money': 327,\n             'nothing': 647,\n             'else': 202,\n             'better': 267,\n             'bit': 64,\n             'doctor': 184,\n             'stay': 75,\n             'lost': 225,\n             'boswell': 2,\n             'promises': 16,\n             'interesting': 72,\n             'pity': 76,\n             'miss': 113,\n             'client': 34,\n             'want': 324,\n             'sit': 90,\n             'best': 269,\n             'slow': 66,\n             'step': 140,\n             'stairs': 32,\n             'passage': 111,\n             'paused': 80,\n             'immediately': 183,\n             'outside': 111,\n             'loud': 65,\n             'authoritative': 3,\n             'tap': 11,\n             'come': 935,\n             'entered': 283,\n             'less': 368,\n             'feet': 180,\n             'inches': 17,\n             'height': 37,\n             'limbs': 68,\n             'hercules': 5,\n             'dress': 139,\n             'rich': 93,\n             'richness': 3,\n             'england': 312,\n             'bad': 156,\n             'taste': 24,\n             'bands': 28,\n             'astrakhan': 2,\n             'slashed': 4,\n             'sleeves': 31,\n             'fronts': 2,\n             'breasted': 2,\n             'coat': 173,\n             'deep': 216,\n             'cloak': 63,\n             'thrown': 93,\n             'shoulders': 126,\n             'lined': 33,\n             'flame': 16,\n             'coloured': 22,\n             'silk': 51,\n             'secured': 49,\n             'neck': 204,\n             'brooch': 2,\n             'consisted': 39,\n             'single': 174,\n             'flaming': 9,\n             'boots': 92,\n             'extended': 76,\n             'halfway': 20,\n             'calves': 4,\n             'trimmed': 9,\n             'tops': 4,\n             'fur': 39,\n             'completed': 26,\n             'impression': 68,\n             'barbaric': 3,\n             'opulence': 4,\n             'suggested': 70,\n             'appearance': 136,\n             'carried': 283,\n             'broad': 93,\n             'brimmed': 5,\n             'hand': 835,\n             'wore': 59,\n             'upper': 131,\n             'extending': 36,\n             'past': 224,\n             'cheekbones': 5,\n             'vizard': 2,\n             'apparently': 69,\n             'raised': 213,\n             'lower': 197,\n             'appeared': 198,\n             'hanging': 43,\n             'straight': 125,\n             'chin': 31,\n             'suggestive': 12,\n             'resolution': 58,\n             'pushed': 82,\n             'length': 64,\n             'obstinacy': 8,\n             'harsh': 23,\n             'voice': 463,\n             'strongly': 42,\n             'marked': 139,\n             'accent': 19,\n             'uncertain': 31,\n             'pray': 80,\n             'seat': 171,\n             'colleague': 8,\n             'dr': 49,\n             'occasionally': 90,\n             'cases': 454,\n             'whom': 490,\n             'honour': 17,\n             'count': 749,\n             'von': 12,\n             'kramm': 3,\n             'nobleman': 12,\n             'understand': 413,\n             'discretion': 14,\n             'trust': 69,\n             'extreme': 73,\n             'prefer': 22,\n             'communicate': 16,\n             'alone': 338,\n             'rose': 244,\n             'caught': 91,\n             'wrist': 69,\n             'back': 747,\n             'chair': 136,\n             'none': 111,\n             'say': 756,\n             'anything': 380,\n             'shrugged': 36,\n             'begin': 98,\n             'binding': 19,\n             'absolute': 57,\n             'secrecy': 19,\n             'years': 572,\n             'end': 466,\n             'present': 330,\n             'weight': 71,\n             'influence': 139,\n             'european': 100,\n             'history': 440,\n             'promise': 68,\n             'excuse': 54,\n             'strange': 221,\n             'august': 71,\n             'person': 186,\n             'employs': 3,\n             'wishes': 43,\n             'agent': 26,\n             'unknown': 88,\n             'confess': 37,\n             'once': 570,\n             'called': 451,\n             'exactly': 48,\n             'aware': 53,\n             'dryly': 6,\n             'circumstances': 108,\n             'delicacy': 12,\n             'precaution': 10,\n             'taken': 439,\n             'quench': 4,\n             'grow': 75,\n             'seriously': 64,\n             'compromise': 72,\n             'families': 46,\n             'speak': 256,\n             'plainly': 40,\n             'implicates': 6,\n             'house': 662,\n             'ormstein': 3,\n             'hereditary': 15,\n             'kings': 28,\n             'murmured': 19,\n             'settling': 17,\n             'closing': 36,\n             'glanced': 177,\n             ...})"
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "NWORDS # 查看语料\n"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% 链接：https://pan.baidu.com/s/1tSGb4Nw86OHpatqYEDXbWg  提取码：2q77\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### 编辑距离\n",
    "\n",
    "两个单词之间的编辑距离定义： 使用几次插入、删除、交换、替换的操作从一个单词变成另一个单词"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "outputs": [],
   "source": [
    "alphabet = 'abcdefghijklmnopqrstuvwxyz' # 用于edits中字符的换和插\n",
    "\n",
    "# 返回所有与单词 w 编辑距离为1的集合\n",
    "def edits1(word):\n",
    "    n = len(word)\n",
    "    # for i in  range(n):\n",
    "    #     print(i)\n",
    "    #     print(set(word[0:i] for i in range(n)))#输出的分别为（，a，ap，app），而且顺序是打乱的，set函数用来函数创建一个无序不重复元素集，可进行关系测试，删除重复数据，还可以计算交集、差集、并集等。\n",
    "    #     print(set(word[i + 1:] for i in range(n)))#输出的分别为（ppl，pl，l，），而且顺序是打乱的\n",
    "   # print(set([word[0:i] + word[i + 1:] for i in range(n)]))\n",
    "  #  print(set( [word[0:i] + word[i + 1] + word[i] + word[i + 2:] for i in range(n - 1)]))\n",
    "   # print(set([word[0:i] + c + word[i + 1:] for i in range(n) for c in alphabet]))\n",
    "   # print(set( [word[0:i] + c + word[i:] for i in range(n + 1) for c in alphabet]))\n",
    "\n",
    "    return set([word[0:i] + word[i + 1:] for i in range(n)] +  # deletion（删）\n",
    "               [word[0:i] + word[i + 1] + word[i] + word[i + 2:] for i in range(n - 1)] +  # transposition（换）\n",
    "               [word[0:i] + c + word[i + 1:] for i in range(n) for c in alphabet] +  # alteration（改）\n",
    "               [word[0:i] + c + word[i:] for i in range(n + 1) for c in alphabet])  # insertion（插）"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "outputs": [
    {
     "data": {
      "text/plain": "{'aello',\n 'ahello',\n 'bello',\n 'bhello',\n 'cello',\n 'chello',\n 'dello',\n 'dhello',\n 'eello',\n 'ehello',\n 'ehllo',\n 'ello',\n 'fello',\n 'fhello',\n 'gello',\n 'ghello',\n 'haello',\n 'hallo',\n 'hbello',\n 'hbllo',\n 'hcello',\n 'hcllo',\n 'hdello',\n 'hdllo',\n 'heallo',\n 'healo',\n 'hebllo',\n 'heblo',\n 'hecllo',\n 'heclo',\n 'hedllo',\n 'hedlo',\n 'heello',\n 'heelo',\n 'hefllo',\n 'heflo',\n 'hegllo',\n 'heglo',\n 'hehllo',\n 'hehlo',\n 'heillo',\n 'heilo',\n 'hejllo',\n 'hejlo',\n 'hekllo',\n 'heklo',\n 'helalo',\n 'helao',\n 'helblo',\n 'helbo',\n 'helclo',\n 'helco',\n 'heldlo',\n 'heldo',\n 'helelo',\n 'heleo',\n 'helflo',\n 'helfo',\n 'helglo',\n 'helgo',\n 'helhlo',\n 'helho',\n 'helilo',\n 'helio',\n 'heljlo',\n 'heljo',\n 'helklo',\n 'helko',\n 'hell',\n 'hella',\n 'hellao',\n 'hellb',\n 'hellbo',\n 'hellc',\n 'hellco',\n 'helld',\n 'helldo',\n 'helle',\n 'helleo',\n 'hellf',\n 'hellfo',\n 'hellg',\n 'hellgo',\n 'hellh',\n 'hellho',\n 'helli',\n 'hellio',\n 'hellj',\n 'helljo',\n 'hellk',\n 'hellko',\n 'helll',\n 'helllo',\n 'hellm',\n 'hellmo',\n 'helln',\n 'hellno',\n 'hello',\n 'helloa',\n 'hellob',\n 'helloc',\n 'hellod',\n 'helloe',\n 'hellof',\n 'hellog',\n 'helloh',\n 'helloi',\n 'helloj',\n 'hellok',\n 'hellol',\n 'hellom',\n 'hellon',\n 'helloo',\n 'hellop',\n 'helloq',\n 'hellor',\n 'hellos',\n 'hellot',\n 'hellou',\n 'hellov',\n 'hellow',\n 'hellox',\n 'helloy',\n 'helloz',\n 'hellp',\n 'hellpo',\n 'hellq',\n 'hellqo',\n 'hellr',\n 'hellro',\n 'hells',\n 'hellso',\n 'hellt',\n 'hellto',\n 'hellu',\n 'helluo',\n 'hellv',\n 'hellvo',\n 'hellw',\n 'hellwo',\n 'hellx',\n 'hellxo',\n 'helly',\n 'hellyo',\n 'hellz',\n 'hellzo',\n 'helmlo',\n 'helmo',\n 'helnlo',\n 'helno',\n 'helo',\n 'helol',\n 'helolo',\n 'heloo',\n 'helplo',\n 'helpo',\n 'helqlo',\n 'helqo',\n 'helrlo',\n 'helro',\n 'helslo',\n 'helso',\n 'heltlo',\n 'helto',\n 'helulo',\n 'heluo',\n 'helvlo',\n 'helvo',\n 'helwlo',\n 'helwo',\n 'helxlo',\n 'helxo',\n 'helylo',\n 'helyo',\n 'helzlo',\n 'helzo',\n 'hemllo',\n 'hemlo',\n 'henllo',\n 'henlo',\n 'heollo',\n 'heolo',\n 'hepllo',\n 'heplo',\n 'heqllo',\n 'heqlo',\n 'herllo',\n 'herlo',\n 'hesllo',\n 'heslo',\n 'hetllo',\n 'hetlo',\n 'heullo',\n 'heulo',\n 'hevllo',\n 'hevlo',\n 'hewllo',\n 'hewlo',\n 'hexllo',\n 'hexlo',\n 'heyllo',\n 'heylo',\n 'hezllo',\n 'hezlo',\n 'hfello',\n 'hfllo',\n 'hgello',\n 'hgllo',\n 'hhello',\n 'hhllo',\n 'hiello',\n 'hillo',\n 'hjello',\n 'hjllo',\n 'hkello',\n 'hkllo',\n 'hlello',\n 'hlelo',\n 'hlllo',\n 'hllo',\n 'hmello',\n 'hmllo',\n 'hnello',\n 'hnllo',\n 'hoello',\n 'hollo',\n 'hpello',\n 'hpllo',\n 'hqello',\n 'hqllo',\n 'hrello',\n 'hrllo',\n 'hsello',\n 'hsllo',\n 'htello',\n 'htllo',\n 'huello',\n 'hullo',\n 'hvello',\n 'hvllo',\n 'hwello',\n 'hwllo',\n 'hxello',\n 'hxllo',\n 'hyello',\n 'hyllo',\n 'hzello',\n 'hzllo',\n 'iello',\n 'ihello',\n 'jello',\n 'jhello',\n 'kello',\n 'khello',\n 'lello',\n 'lhello',\n 'mello',\n 'mhello',\n 'nello',\n 'nhello',\n 'oello',\n 'ohello',\n 'pello',\n 'phello',\n 'qello',\n 'qhello',\n 'rello',\n 'rhello',\n 'sello',\n 'shello',\n 'tello',\n 'thello',\n 'uello',\n 'uhello',\n 'vello',\n 'vhello',\n 'wello',\n 'whello',\n 'xello',\n 'xhello',\n 'yello',\n 'yhello',\n 'zello',\n 'zhello'}"
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "edits1('hello')"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "outputs": [],
   "source": [
    "# 返回编辑距离为2的单词集合，并且这些单词是在语料库中已经有的\n",
    "def known_edits2(word):\n",
    "    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "outputs": [
    {
     "data": {
      "text/plain": "{'bell',\n 'belle',\n 'bells',\n 'belly',\n 'cell',\n 'cells',\n 'fell',\n 'fellow',\n 'felo',\n 'hall',\n 'hallo',\n 'halls',\n 'halo',\n 'heal',\n 'heals',\n 'heel',\n 'heels',\n 'held',\n 'helen',\n 'hell',\n 'hello',\n 'helm',\n 'help',\n 'helps',\n 'hero',\n 'hill',\n 'hills',\n 'hilly',\n 'hollow',\n 'holly',\n 'hull',\n 'hullo',\n 'jealo',\n 'jelly',\n 'mellow',\n 'sell',\n 'selle',\n 'shell',\n 'shells',\n 'tell',\n 'tells',\n 'telly',\n 'vell',\n 'well',\n 'wells',\n 'yell',\n 'yellow',\n 'yells'}"
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "known_edits2('hello')"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "outputs": [],
   "source": [
    "# 先构造了一个变量w，然后把words中的单词一个个给w，判断一下是否在语料库中，是的话输出\n",
    "def known(words):\n",
    "    return set(w for w in words if w in NWORDS)\n"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "outputs": [],
   "source": [
    "def correct(word):\n",
    "    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]\n",
    "    # 如果known(set)非空, candidate 就会选取这个集合, 而不继续计算后面的，这条代码实现了P（w\\c）的过程，优先级从编辑距离为0开始到1到2到本身，\n",
    "    # 我们已知p（w\\c）的含义为想打一个单词却打成另外一个单词的概率，这个概率越小越好，而这条代码的编辑距离的优先级实现了越小越好的这个过程\n",
    "    print(candidates) # 可选项\n",
    "    return max(candidates, key=lambda w: NWORDS[w])\n",
    "    # 返回candidates中在语料库中词频较大的词，也就是p（c）较大的词，由贝叶斯公式可知，需要找的是先验概率和似然概率之积最大的词，这条代码综合考虑了先验概率和似然概率。"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'apply', 'apple'}\n",
      "apply\n"
     ]
    }
   ],
   "source": [
    "#print(edits1('appl'))\n",
    "#print(known_edits2('appl'))\n",
    "#print(known('appl'))\n",
    "print(correct('appl')) # 会输出apply\n"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}